R Bootcamp day 3
Statistical Computing Track
trojan R logo

George G. Vega Yon


University of Southern California
Department of Preventive Medicine

August 21th, 2019

Agenda

  1. High-Performance Computing: An overview

  2. Parallel computing in R

  3. Extended examples

You can download the Rmd file here and the R code Rmd file here

High-Performance Computing: An overview

Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:

  1. Big data: How to work with data that doesn’t fit your computer

  2. Parallel computing: How to take advantage of multiple core systems

  3. Compiled code: Write your own low-level code (if R doesn’t has it yet…)

(Checkout CRAN Task View on HPC)

Big Data

Parallel computing

Flynn's Classical Taxonomy ([Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National Laboratory](https://computing.llnl.gov/tutorials/parallel_comp/#Whatis))

Flynn’s Classical Taxonomy (Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National Laboratory)

We will be focusing on the Single Instruction stream Multiple Data stream

Some vocabulary for HPC

In raw terms

You may not have access to a supercomputer, but certainly HPC/HTC clusters are more accesible these days, e.g. AWS provides a service to create HPC clusters at a low cost (allegedly, since nobody understands how pricing works)

What’s “a core”?

Taxonomy of CPUs (Downloaded from de https://slurm.schedmd.com/mc_support.html)

Taxonomy of CPUs (Downloaded from de https://slurm.schedmd.com/mc_support.html)

Now, how many cores does your computer has, the parallel package can tell you that:

parallel::detectCores()
## [1] 4

What is parallel computing, anyway?

f <- function(n) n*2
f(1:4)
Here we are using a single core. The function is applied one element at a time, leaving the other 3 cores without usage.

Here we are using a single core. The function is applied one element at a time, leaving the other 3 cores without usage.

What is parallel computing, anyway? (cont’d)

f <- function(n) n*2
f(1:4)
In this more intelligent way of computation, we are taking full advantage of our computer by using all 4 cores at the same time. This will translate in a reduced computation time which, in the case of complicated/long calculations, can be an important speed gain.

In this more intelligent way of computation, we are taking full advantage of our computer by using all 4 cores at the same time. This will translate in a reduced computation time which, in the case of complicated/long calculations, can be an important speed gain.

GPU vs CPU

[NVIDIA Blog](http://www.nvidia.com/object/what-is-gpu-computing.html)

NVIDIA Blog

Let’s think before we start…

When is it a good idea to go HPC?

When is it a good idea?

Ask yourself these questions before jumping into HPC!

Ask yourself these questions before jumping into HPC!

Parallel computing in R

While there are several alternatives (just take a look at the High-Performance Computing Task View), we’ll focus on the following R-packages for explicit parallelism:

Implicit parallelism, on the other hand, are out-of-the-box tools that allow the programmer not to worry about parallelization, e.g. such as gpuR for Matrix manipulation using GPU, tensorflow

And there’s also a more advanced set of options

Parallel workflow

(Usually) We do the following:

  1. Create a PSOCK/FORK (or other) cluster using makePSOCKCluster/makeForkCluster (or makeCluster)

  2. Copy/prepare each R session (if you are using a PSOCK cluster):

    1. Copy objects with clusterExport

    2. Pass expressions with clusterEvalQ

    3. Set a seed

  3. Do your call: parApply, parLapply, etc.

  4. Stop the cluster with clusterStop

Types of clusters: PSOCK

Types of clusters: Fork

Take a look at the foreach, Rmpi, and future R packages.

Let’s get started!

Ex 1: Parallel RNG with makePSOCKCluster

# 1. CREATING A CLUSTER
library(parallel)
nnodes <- 4L
cl     <- makePSOCKcluster(nnodes)    

# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`

# 3. DO YOUR CALL
ans <- parSapply(cl, 1:nnodes, function(x) runif(1e3))
(ans0 <- var(ans))
##               [,1]          [,2]          [,3]          [,4]
## [1,]  0.0861888293 -0.0001633431  5.939143e-04 -3.672845e-04
## [2,] -0.0001633431  0.0853841838  2.390790e-03 -1.462154e-04
## [3,]  0.0005939143  0.0023907904  8.114219e-02 -4.714618e-06
## [4,] -0.0003672845 -0.0001462154 -4.714618e-06  8.467722e-02

Making sure is reproducible

# I want to get the same!
clusterSetRNGStream(cl, 123)
ans1 <- var(parSapply(cl, 1:nnodes, function(x) runif(1e3)))

# 4. STOP THE CLUSTER
stopCluster(cl)

all.equal(ans0, ans1) # All equal!
## [1] TRUE

Ex 2: Parallel RNG with makeForkCluster

In the case of makeForkCluster

# 1. CREATING A CLUSTER
library(parallel)

# The fork cluster will copy the -nsims- object
nsims  <- 1e3
nnodes <- 4L
cl     <- makeForkCluster(nnodes)    

# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123)

# 3. DO YOUR CALL
ans <- do.call(cbind, parLapply(cl, 1:nnodes, function(x) {
  runif(nsims) # Look! we use the nsims object!
               # This would have fail in makePSOCKCluster
               # if we didn't copy -nsims- first.
  }))

(ans0 <- var(ans))
##               [,1]          [,2]          [,3]          [,4]
## [1,]  0.0861888293 -0.0001633431  5.939143e-04 -3.672845e-04
## [2,] -0.0001633431  0.0853841838  2.390790e-03 -1.462154e-04
## [3,]  0.0005939143  0.0023907904  8.114219e-02 -4.714618e-06
## [4,] -0.0003672845 -0.0001462154 -4.714618e-06  8.467722e-02

Again, we want to make sure this is reproducible

# Same sequence with same seed
clusterSetRNGStream(cl, 123)
ans1 <- var(do.call(cbind, parLapply(cl, 1:nnodes, function(x) runif(nsims))))

ans0 - ans1 # A matrix of zeros
##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
## [3,]    0    0    0    0
## [4,]    0    0    0    0
# 4. STOP THE CLUSTER
stopCluster(cl)

Well, if you are a MacOS/Linux user, there’s a simplier way of doing this…

Ex 3: Parallel RNG with mclapply (Forking on the fly)

In the case of mclapply, the forking (cluster creation) is done on the fly!

# 1. CREATING A CLUSTER
library(parallel)

# The fork cluster will copy the -nsims- object
nsims  <- 1e3
nnodes <- 4L
# cl     <- makeForkCluster(nnodes) # mclapply does it on the fly

# 2. PREPARING THE CLUSTER
set.seed(123) 

# 3. DO YOUR CALL
ans <- do.call(cbind, mclapply(1:nnodes, function(x) runif(nsims)))

(ans0 <- var(ans))
##              [,1]        [,2]         [,3]         [,4]
## [1,]  0.085384184 0.002390790  0.006576204 -0.003998278
## [2,]  0.002390790 0.081142190  0.001846963  0.001476244
## [3,]  0.006576204 0.001846963  0.085175347 -0.002807348
## [4,] -0.003998278 0.001476244 -0.002807348  0.082425477

Once more, we want to make sure this is reproducible

# Same sequence with same seed
set.seed(123) 
ans1 <- var(do.call(cbind, mclapply(1:nnodes, function(x) runif(nsims))))

ans0 - ans1 # A matrix of zeros
##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
## [3,]    0    0    0    0
## [4,]    0    0    0    0
# 4. STOP THE CLUSTER
# stopCluster(cl) no need of doing this anymore

Ex 4: RcppArmadillo + OpenMP

#include <omp.h>
#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(openmp)]]

using namespace Rcpp;

// [[Rcpp::export]]
arma::mat dist_par(const arma::mat & X, int cores = 1) {
  
  // Some constants
  int N = (int) X.n_rows;
  int K = (int) X.n_cols;
  
  // Output
  arma::mat D(N,N);
  D.zeros(); // Filling with zeros
  
  // Setting the cores
  omp_set_num_threads(cores);
  
#pragma omp parallel for shared(D, N, K, X) default(none)
  for (int i=0; i<N; ++i)
    for (int j=0; j<i; ++j) {
      for (int k=0; k<K; k++) 
        D.at(i,j) += pow(X.at(i,k) - X.at(j,k), 2.0);
      
      // Computing square root
      D.at(i,j) = sqrt(D.at(i,j));
      D.at(j,i) = D.at(i,j);
    }
      
  
  // My nice distance matrix
  return D;
}

# Simulating data
set.seed(1231)
K <- 1000
n <- 500
x <- matrix(rnorm(n*K), ncol=K)

# Are we getting the same?
table(as.matrix(dist(x)) - dist_par(x, 4)) # Only zeros
## 
##      0 
## 250000

# Benchmarking!
rbenchmark::benchmark(
  dist(x),                 # stats::dist
  dist_par(x, cores = 1),  # 1 core
  dist_par(x, cores = 2),  # 2 cores
  dist_par(x, cores = 4), #  4 cores
  replications = 10, order="elapsed"
)[,1:4]
##                     test replications elapsed relative
## 4 dist_par(x, cores = 4)           10   2.997    1.000
## 3 dist_par(x, cores = 2)           10   3.408    1.137
## 2 dist_par(x, cores = 1)           10   3.563    1.189
## 1                dist(x)           10   4.239    1.414

Ex 5: The future

library(future)
plan(multicore)

# We are creating a global variable
a <- 2

# Creating the futures has only the overhead (setup) time
system.time({
  x1 %<-% {Sys.sleep(3);a^2}
  x2 %<-% {Sys.sleep(3);a^3}
})
##    user  system elapsed 
##   0.044   0.000   6.047

# Let's just wait 5 seconds to make sure all the cores have returned
Sys.sleep(3)
system.time({
  print(x1)
  print(x2)
})
## [1] 4
## [1] 8
##    user  system elapsed 
##   0.000   0.000   0.001

Ex 6: Simulating \(\pi\)

The R code to do this

pisim <- function(i, nsim) {  # Notice we don't use the -i-
  # Random points
  ans  <- matrix(runif(nsim*2), ncol=2)
  
  # Distance to the origin
  ans  <- sqrt(rowSums(ans^2))
  
  # Estimated pi
  (sum(ans <= 1)*4)/nsim
}

Ex 6: Simulating \(\pi\) (cont’d)

# Setup
cl <- makePSOCKcluster(4L)
clusterSetRNGStream(cl, 123)

# Number of simulations we want each time to run
nsim <- 1e5

# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))

# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
rbenchmark::benchmark(
  parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
  serial   = sapply(1:100, pisim, nsim=nsim), replications = 1
)[,1:4]
##       test replications elapsed relative
## 1 parallel            1   0.417    1.000
## 2   serial            1   1.093    2.621

ans_par <- parSapply(cl, 1:100, pisim, nsim=nsim)
ans_ser <- sapply(1:100, pisim, nsim=nsim)
stopCluster(cl)
##      par      ser        R 
## 3.141126 3.141247 3.141593

Thanks!

gvegayon
@gvegayon
ggvy.cl

Presentation created with rmarkdown::slidy_presentation

See also

For more, checkout the CRAN Task View on HPC

Session info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## Random number generation:
##  RNG:     L'Ecuyer-CMRG 
##  Normal:  Inversion 
##  Sample:  Rejection 
##  
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] future_1.14.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2                rbenchmark_1.0.0         
##  [3] codetools_0.2-16          listenv_0.7.0            
##  [5] digest_0.6.20             magrittr_1.5             
##  [7] evaluate_0.14             icon_0.1.0               
##  [9] highr_0.8                 stringi_1.4.3            
## [11] RcppArmadillo_0.9.600.4.0 rmarkdown_1.14           
## [13] tools_3.6.1               stringr_1.4.0            
## [15] xfun_0.9                  yaml_2.2.0               
## [17] compiler_3.6.1            globals_0.12.4           
## [19] htmltools_0.3.6           knitr_1.24